By the end of this lesson, you will be able to:
Data visualization is the representation of data through use of common graphics, such as charts, plots, info-graphics, and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand. This technique mainly use for
For more details see [https://r-coder.com/plot-r/]
Syntax
plot(x, y, ...)
- the following arguments are optional
for dot plot: type = 'p' (default)
for line chart: type = 'l'
to assign plot title: main = "title", a charactor field
xlab = "Name of X varaible", a charactor field
ylab = "Name of y varaible", a charactor field
xlim = limit of x values, a numerice range
ylim = limit of y values, a numerice range
plot(0,1, type = 'n')
A scatter chart (or a scatter plot) is a chart that shows the relationship between two quantitative variables.
str(iris)
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
x = iris$Sepal.Length
y= iris$Sepal.Width
plot(x, y)
x = iris$Sepal.Length
y= iris$Sepal.Width
plot(x, y, type = 'p', xlim = range(x), ylim = range(y),
xlab = "Sepal.Length",
ylab = "Sepal.Width",
main = "Association of Sepal.Length and Sepal.Width of iris data")
For more details see [http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf]
x = iris$Sepal.Length
y= iris$Sepal.Width
plot(x, y, col = 'red')
For more details see here [https://www.r-bloggers.com/2021/06/r-plot-pch-symbols-different-point-shapes-in-r/]
x = iris$Sepal.Length
y= iris$Sepal.Width
plot(x, y, col = 'red', pch = 19)
x = iris$Sepal.Length
y= iris$Sepal.Width
plot(x, y, col = iris$Species, pch = 10,
main = "Color by Species")
x = iris$Sepal.Length
y= iris$Sepal.Width
y1 = iris$Petal.Width
plot(x, y, ylim = range(y, y1), col = 'red', pch = 10)
points(x, y1, col = 'blue', pch = 20)
In the previous example we observed the association of Sepal.Width and Petal.Width with x-variable Sepal.Length. Now let's observe the association of those y-variables with x-variables Sepal.Length and Petal.Length.
It is very easy to combine multiple plots into one overall graph in R, using the par(mfrow = c(i, j)) .
par(mfrow = c(i, j)): combines the plots i indicates number of rows j indicates number of columns
par(mfrow = c(2, 1))
#plot 1
x1 = iris$Sepal.Length
y1= iris$Sepal.Width
x2 = iris$Petal.Length
y2 = iris$Petal.Width
plot(x1, y1, xlab = "Sepal.Length", ylab = "Sepal.Width", col = 'red', pch = 19)
plot(x2, y2, xlab = "Sepal.Length", ylab = "Petal.Length", col = 'green', pch = 20)
par(mfrow = c(2, 2))
#plot 1
for (j in 2:4){
plot(iris$Sepal.Length, iris[,j], ylim = range(iris[,2:4]), xlab = "Sepal.Length",
ylab = names(iris)[j], col = iris$Species, pch = 20)
}
It should be noted that in RStudio the graph will be displayed in the pane layout and figure size can be adjusted in r-chunk by assigning values for fig.width and fig.height.
# Figure size in Rstudio ```{r, fig.width = 4, fig.height = 3} x= rnorm(20) y = 2*x+ 1 plot(x, y) ```
options(repr.plot.width=10, repr.plot.height=5, center = TRUE)
par(mfrow = c(1, 2))
#plot 1
x1 = iris$Sepal.Length
y1= iris$Sepal.Width
y2 = iris$Petal.Width
plot(x1, y1, ylim = range(c(y1, y2)), col = 'red', pch = 18)
points(x1, y2, col = 'blue', pch = 20)
#plot 2
x2= iris$Petal.Length
y3 = iris$Sepal.Width
y4 = iris$Petal.Width
plot(x2, y3, ylim = range(c(y3, y4)), col = 'red', pch = 18)
points(x2, y4, col = 'blue', pch = 20)
?par()
We can change the parameters mai, mar, tcl. Type help(par) in R-console for more details.
``` mai: A numerical vector of the form c(bottom, left, top, right) which gives the margin size specified in inches. mar: A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1. ```
options(repr.plot.width=10, repr.plot.height=6)
par(mfrow = c(2, 2), tcl=-0.01, mai=c(0.5,0.5,0.5,0.5))
#plot 1
x = iris$Sepal.Length
y1= iris$Sepal.Width
y2 = iris$Petal.Width
plot(x, y1, ylim = range(y1), xlab = "Sepal.Length",
ylab = "Sepal.Width", col = "black", pch = 18)
#plot2
plot(x, y2, ylim = range(y2), xlab = "Sepal.Length",
ylab = "Petal.Width", col = 'blue', pch = 18)
#plot 3 and 4
x1 = iris$Petal.Length
y3= iris$Sepal.Width
y4 = iris$Petal.Width
plot(x1, y3, ylim = range(y3), xlab = "Petal.Length",
ylab = "Sepal.Width", col = "red", pch = 18)
#plot4
plot(x1, y4, ylim = range(y4), xlab = "Petal.Length",
ylab = "Petal.Width", col = 'green', pch = 18)
x1 = iris$Sepal.Length
x2 = iris$Petal.Length
idx = 1: length(x1)
plot(idx, x1, type = "l", xlab = "", ylab = "", col = 'red', lty = 1,
main = "Sepal.Length vs Petal.Length comarision")
lines(idx, x2, type = "l", xlab = "", ylab = "", lty = 2, col = 'blue')
x1 = iris$Sepal.Length
x2 = iris$Petal.Length
idx = 1: length(x1)
plot(idx, x1, type = "l", xlab = "", ylab = "", ylim = range(x1, x2),
lty = 1, col = 'red', main = "Sepal.Length vs Petal.Length comarision")
lines(idx, x2, type = "l", xlab = "", ylab = "", lty = 2, col = 'blue')
When we are comparing multiple variables using trace plot or scatter plot, it is vary hard to identify the the visual of related variable. So, assigning legend is important in such of cases.
For more details see [https://r-coder.com/add-legend-r/]
x1 = iris$Sepal.Length
x2 = iris$Petal.Length
idx = 1: length(x1)
plot(idx, x1, type = "l", xlab = "", ylab = "", ylim = range(x1, x2),
lty = 1, col = 'red', main = "Sepal.Length vs Petal.Length comarision")
lines(idx, x2, type = "l", xlab = "", ylab = "", lty = 2, col = 'blue')
legend(x = "topleft", # Position
legend = c("Sepal.Length", "Petal.Length"), # Legend texts
lty = c(1, 2), # Line types
col = c('red', 'blue'), # Line colors
lwd = 2) # Line width
x1 = iris$Sepal.Length
x2 = iris$Petal.Length
idx = 1: length(x1)
plot(idx, x1, type = "l", xlab = "", ylab = "", ylim = range(x1, x2),
lty = 1, col = 'red', main = "Sepal.Length vs Petal.Length comarision")
lines(idx, x2, type = "l", xlab = "", ylab = "", lty = 2, col = 'blue')
legend(x = "topright", # Position
legend = c("Sepal.Length", "Petal.Length"), # Legend texts
inset = c(0, 0),
lty = c(1, 2), # Line types
col = c('red', 'blue'), # Line colors
lwd = 2)
A bar plot is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to their corresponding values (or count). The bars can be plotted vertically or horizontally.
str(mtcars)
'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
library(dplyr);
car_counts_by_cyl = mtcars %>%
group_by(cyl) %>%
summarise(count = n())
car_counts_by_cyl
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
| cyl | count |
|---|---|
| <dbl> | <int> |
| 4 | 11 |
| 6 | 7 |
| 8 | 14 |
values = car_counts_by_cyl$count
cyl =car_counts_by_cyl$cyl
#help(barplot)
# One row, two columns
par(mfrow = c(1, 2))
# Absolute frequency barplot
barplot(height = values, names =cyl, xlab = "cyl",
main = "Absolute frequency",
col = rainbow(3))
# Relative frequency barplot
barplot(height = prop.table(values)*100, names =cyl,
xlab = "cyl", main = "Relative frequency (%)",
col = rainbow(3))
Boston311_2023_data =
read.csv("https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/e6013a93-1321-4f2a-bf91-8d8a02f1e62f/download/tmp518q5snq.csv")
library(stringr)
library(dplyr)
Boston311_2023_data$Parking_Enforcement_status <- str_detect(Boston311_2023_data$case_title,
regex("\\bParking Enforcement\\b"))
Parking_Enforcement_by_nbd <- Boston311_2023_data %>%
group_by(neighborhood) %>%
summarise(nbd_count_Parking_Enforcement = n()) %>%
arrange(desc(nbd_count_Parking_Enforcement))
head(Parking_Enforcement_by_nbd, 10)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
| neighborhood | nbd_count_Parking_Enforcement |
|---|---|
| <chr> | <int> |
| Dorchester | 36272 |
| Roxbury | 21426 |
| South Boston / South Boston Waterfront | 18835 |
| Allston / Brighton | 18490 |
| East Boston | 17862 |
| South End | 15265 |
| Jamaica Plain | 13728 |
| Downtown / Financial District | 11526 |
| Greater Mattapan | 11191 |
| Back Bay | 10559 |
top_10_nbd = Parking_Enforcement_by_nbd[1:10, ]
barplot(names = top_10_nbd$neighborhood, height = top_10_nbd$nbd_count_Parking_Enforcement,
col = rainbow(10), las = 2)
#las = 1, group names printed horizontally
#las = 2, group names printed vertically
par(mar, mgp, las)
par(mar=c(5.1, 4.1, 4.1, 2.1), mgp=c(3, 1, 0), las=0)
par sets or adjusts plotting parameters. Here we consider the following three parameters: margin size (mar), axis label locations (mgp), and axis label orientation (las).
mar – A numeric vector of length 4, which sets the margin sizes in the following order: bottom, left, top, and right. The default is c(5.1, 4.1, 4.1, 2.1).
mgp – A numeric vector of length 3, which sets the axis label locations relative to the edge of the inner plot window. The first value represents the location the labels (i.e. xlab and ylab in plot), the second the tick-mark labels, and third the tick marks. The default is c(3, 1, 0).
las – A numeric value indicating the orientation of the tick mark labels and any other text added to a plot after its initialization. The options are as follows: always parallel to the axis (the default, 0), always horizontal (1), always perpendicular to the axis (2), and always vertical (3).
### Horizontal barplot
par(mar = c(4, 16, 2, 2))
top_10_nbd = Parking_Enforcement_by_nbd[1:10, ]
barplot(names = top_10_nbd$neighborhood, height = top_10_nbd$nbd_count_Parking_Enforcement,
col = rainbow(10), horiz = TRUE, las = 1)
### Barplot for continuous variable
var1 = iris$Sepal.Length
cut_off = c(0, 5, 6, 7 , 8)
catgory = c("low", "low_mid", "high_mid", "high")
Sepal_Len_cat1 = cut(var1, breaks = cut_off, include.lowest = TRUE, right = FALSE, labels = catgory)
iris_new = cbind(iris, Sepal_Len_cat1)
barplot(table(iris_new$Sepal_Len_cat1), col = rainbow(4),
legend.text = levels(iris_new$Sepal_Len_cat1))# With Legend
str(mtcars)
'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Variable am to factor
am = mtcars$am
am <- factor(am)
# Change factor levels
levels(am) <- c("Automatic", "Manual")
summary_data <- tapply(mtcars$hp, list(cylinders = mtcars$cyl,
transmission = am),FUN = mean, na.rm = TRUE)
summary_data
| Automatic | Manual | |
|---|---|---|
| 4 | 84.66667 | 81.8750 |
| 6 | 115.25000 | 131.6667 |
| 8 | 194.16667 | 299.5000 |
barplot(summary_data, xlab = "Transmission type",
main = "Horsepower mean",
col = rainbow(3),
beside = TRUE,
legend.text = rownames(summary_data),
args.legend = list(title = "Cylinders", x = "topright",
inset = c(-0.20, 0)))
par(mar = c(5, 5, 4, 10), las = 0)
barplot(summary_data,
main = "Horsepower mean",
xlab = "Transmission type", ylab = "HP mean",
col = c('red', 'blue', 'green'),
legend.text = rownames(summary_data),
beside = FALSE, # Stacked bars (default)
args.legend = list(title = "Cylinders", x = "topright",
inset = c(-0.2, 0)))
A pie chart is used to represent data in numerical proportions. Pie chart in R is created using pie() function.
# cyl-wise distribution of data using pie-chart
count_cars <- mtcars %>%
group_by(cyl) %>%
summarise(count = n())
For hcl.colors see ["https://blog.r-project.org/2019/04/01/hcl-based-color-palettes-in-grdevices/"]
car_type <- paste(count_cars$cyl, "cyl")
count <- count_cars$count
# calculating percentage participation
perc <- round(count/sum(count)* 100, 2)
# add frequency or proportion to country names to create labels
labels <- paste(car_type, perc,'%')
pie(count, labels = labels, radius = 1, col = hcl.colors(n = 3, palette = 'Spectral'),
border = 'gray', main = "Pie chart in R")
a. horizontal line in R
# plot function is used to plot
# the data type with "n" is used to remove the plotted dots
# to remove the plotted data
plot(1, type = 'n', # n stands for no points
xlab = "", #no x label
ylab = "", #no y label
xlim = c(0, 5), # x limit
ylim = c(0, 5) # y limit
)
plot(1, type = 'n', # n stands for no points
xlab = "", #no x label
ylab = "", #no y label
xlim = c(0, 5), # x limit
ylim = c(0, 5) # y limit
)
abline(h = 2.5, col = "red")
b. vertical line in R
# plot function is used to plot
# the data type with "n" is used to remove the plotted dots
# to remove the plotted data
plot(1, type = "n", xlab = "",
ylab = "", xlim = c(0, 5),
ylim = c(0, 5))
abline(v = 2, col = 'red')
c. Horizontal and vertical line in R
# plot function is used to plot
# the data type with "n" is used to remove the plotted dots
# to remove the plotted data
plot(1, type = "n", xlab = "",
ylab = "", xlim = c(0, 5),
ylim = c(0, 5))
abline(h = 2.5, v = 2, col = 'red')
d. Line with slope and intercept in R
# plot function is used to plot
# the data type with "n" is used to remove the plotted dots
# to remove the plotted data
plot(1, type = "n", xlab = "",
ylab = "", xlim = c(0, 5),
ylim = c(0, 5))
abline(a = 0, # Intercept
b = 1, col = 'red') # Slope
abline(a = 5, # Intercept
b = -1, col = 'blue') # Slope
Histogram is the most widely used graph to represent quantitative (or numerical) data mostly for the continuous in nature.
Syntax
hist(x,....)
hist(x, breaks = "Sturges",
freq = NULL, probability = !freq,
include.lowest = TRUE, right = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(breaks), ylim = NULL,
xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE,
nclass = NULL, warn.unused = TRUE, …)
hist(iris$Sepal.Length, breaks = 20, col = 'gray', probability = TRUE)
hist(iris$Sepal.Length, breaks = 15, xlab = 'Sepal.Length',
ylab = 'Relative Frequency',probability = TRUE, col = 'gray',
main = "Histogram of Sepal.Length of Iris data")
par(mfrow = c(2, 2))
x <- iris$Sepal.Length # First group
y <- iris$Petal.Length # Second group
hist(x, main = "Histogram of Sepal.Length")
hist(y, main = "Histogram of Petal.Length")
# Combine plot
hist(x, xlim = c(0, 8),ylim = c(0, 50), main = "Histogram of Two variables")
hist(y, add = TRUE, col = rgb(1, 0, 0, alpha = 1))
par(mfrow = c(1, 2))
x <- iris$Sepal.Length # First group
y <- iris$Petal.Length # Second group
hist(x, probability = TRUE, main = "Histogram of Sepal.Length")
lines(density(x), lwd = 2, col = 'red')
hist(y, probability = TRUE, main = "Histogram of Petal.Length")
lines(density(y), lwd = 2, col = 'red')
x <- iris$Sepal.Length # First group
y <- iris$Petal.Length # Second group
hist(x, ylim = c(0, 0.5), probability = TRUE,
main = "Histogram of Sepal.Length")
x_val = seq(min(x), max(x), length.out = 100)
f_val = dnorm(x_val, mean = mean(x), sd = sd(x))
lines(x_val, f_val, lwd = 2, col = 'red')
Box plots (Chambers 1983) are an excellent tools for detecting and illustrating location and variation changes between different groups of data.
boxplot(x, ylab = "Sepal.Length")
boxplot(x, xlab = "Sepal.Length", horizontal = TRUE)
boxplot(x, xlab = "Sepal.Length", horizontal = TRUE)
stripchart(x, method = "jitter", pch = 19, add = TRUE, col = "red")
IQR = Q3 - Q1
Usual low value, L = Q1 - 1.5*IQR
Usual high value, U = Q3 + 1.5*IQR
Any value outside of the range between L and U considered as outlier
Example of outliers
x <- rnorm(50, 20, 5)
x1 <- c(-4, -7, 0, 50, 55) # add few extreme data points
x <- c(x, x1)
boxplot(x)
Manual boundry lines for outliers
Q1 <- quantile(x, prob = 0.25)
Q3 <- quantile(x, prob = 0.75)
IQR <- Q3 - Q1
L <- Q1 - 1.5*(IQR)
U <- Q3 + 1.5*(IQR)
boxplot(x, horizontal = TRUE, main = "Detection of outlier uising boxpolt ")
abline(v = L, col = 'red')
abline(v = U, col = 'blue')
boxplot(Sepal.Length ~ Species, data = iris, col = rainbow(3), horizontal = FALSE)
x = iris$Sepal.Length
y = iris$Petal.Length
plot(x, y, pch = 19, col = "gray52")
# Linear fit
abline(lm(y ~ x), col = "orange", lwd = 3)
# Smooth fit
lines(lowess(x, y), col = "blue", lwd = 3)
# Legend
legend("topleft", legend = c("Linear", "Smooth"),
lwd = 3, lty = c(1, 1), col = c("orange", "blue"))
It is a pairwise scatter plot, that shows the pairwise association between variables.
#numerical_df <- subset(iris, select = c(Sepal.Length, Sepal.Width,Petal.Length,Petal.Width))
#pairs(numerical_df)
pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, col = iris$Species, data = iris)
ggplot2 is one of the most used packages for data visualization in R and it builds plots in layers.
ggplot2 builds graphs in layers. It divides the plot into three parts:
For more details see the link [http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html]
# install.packages("ggplot2")
library(dplyr)
library(ggplot2)
iris %>%
ggplot() +
aes(x = Sepal.Length, y = Sepal.Width) +
geom_point(size=2, shape=12)
#geom_point(aes(size=Sepal.Length)) # size of points varies as values
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(colour = Species)) + # Points and color by group
scale_color_discrete("Type") + # Change legend title
xlab("Sepal Length") + # X-axis label
ylab("Sepal Width") + # Y-axis label
theme(axis.line = element_line(colour = "black", # Changes the default theme
size = 0.24))
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
ggtitle("Scattor plot in R") +
theme(plot.title = element_text(hjust=0.5)) + # Assign title on center
geom_point(aes(color = Species)) + # Points and color by group
#scale_color_discrete("type") + # Change legend title
xlab("Sepal.Length") + # X-axis label
ylab("Sepal.Width") + # Y-axis label
theme(axis.line = element_line(colour = "red",size = 0.5)) # Changes the default theme (xy-axes)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_abline(intercept = 3, slope = 0 ) +
ggtitle("Scattor plot in R") +
theme(plot.title = element_text(hjust=0.5)) + # Assign title on center
geom_point(aes(colour = Species)) + # Points and color by group
scale_color_discrete("Species") + # Change legend title
xlab("Sepal.Length") + # X-axis label
ylab("Sepal.Width") + # Y-axis label
theme(axis.line = element_line(colour = "black", # Changes the default theme
size = 0.01))
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
ggtitle("Scattor plot in R") +
theme(plot.title = element_text(hjust=0.5)) + # Assign title on center
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
#theme_void() + #remove background
#theme_classic() +#remove background
geom_point(aes(colour = Species)) + # Points and color by group
#scale_color_discrete("Species") + # Change legend title
xlab("Sepal.Length") + # X-axis label
ylab("Sepal.Width") + # Y-axis label
theme(axis.line = element_line(colour = "black", # Changes the default theme
size = 0.5))
# Change the line type
ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_line(linetype = "dashed")
# add points
ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_line(linetype = "solid")+
geom_point()
# Add labels and plot by category
ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_line(aes(colour = Species))+
geom_point(aes(colour = Species)) + # Points and color by group
#scale_color_discrete("Species") + # Change legend title
xlab("Sepal.Length") + # X-axis label
ylab("Sepal.Width") # Y-axis label
df = mtcars %>%
group_by(cyl)%>%
summarise(count = n())
df
| cyl | count |
|---|---|
| <dbl> | <int> |
| 4 | 11 |
| 6 | 7 |
| 8 | 14 |
str(mtcars)
'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Basic barplot
df$cyl = factor(df$cyl)
ggp <- ggplot(data=df, aes(x = cyl, y = count, fill = cyl) )+
geom_bar(stat="identity", width=0.7) +
theme_minimal()
ggp
# for the summary x variable need to be a factor variable
ggp <- ggplot(data = df, aes(y = factor(cyl), x = count, fill = cyl)) +
geom_bar(stat="identity", width=0.8) +
theme_minimal()
ggp
# Don't map a variable to x if you use whole data set
ggp <- ggplot(mtcars, aes(y=factor(cyl), fill = factor(cyl)))+
scale_color_discrete("cyl") + # Change legend title
geom_bar() +
theme_minimal()
ggp
top_10_nbd = Parking_Enforcement_by_nbd[1:10, ]
ggp <- ggplot(top_10_nbd, aes(y=neighborhood, x = nbd_count_Parking_Enforcement, fill = neighborhood ))+
geom_bar(stat="identity") +
scale_colour_manual(name = "neighborhood")+
xlab("Parking enforcement count by Neighborhood") + # X-axis label
ylab("Neighborhood") + # Y-axis label
theme_minimal()
ggp
top_10_nbd = Parking_Enforcement_by_nbd[1:10, ]
ggp <- ggplot(top_10_nbd, aes(y=reorder(neighborhood, nbd_count_Parking_Enforcement), x = nbd_count_Parking_Enforcement, fill = neighborhood ))+
geom_bar(stat="identity") +
scale_colour_manual(name = "neighborhood")+
theme_void()
ggp
df = mtcars %>%
group_by(cyl)%>%
summarise(count = n())
df$cyl = as.factor(df$cyl)
# Basic barplot
ggp <- ggplot(data=df, aes(x ='', y = count, fill = cyl)) +
geom_bar(stat="identity", width=0.7) +
theme_minimal()
ggp
ggp <- ggplot(data=df, aes(x = '', y = count, fill = cyl)) +
geom_bar(stat="identity", width=0.7) +
coord_polar("y", start=0)
ggp
df$perc = round(df$count/sum(df$count),4) *100
ggp <- ggplot(data=df, aes(x = '', y = perc, fill = cyl)) +
geom_col() +
geom_text(aes(label = paste(perc, '%')), color = rep("white", 3),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
theme_void()
ggp
#?geom_histogram()
# Basic histogram
ggplot(iris, aes(x=Sepal.Length)) +
geom_histogram(bins = 20)
# Change the width of bins
ggplot(iris, aes(x=Sepal.Length)) +
geom_histogram(binwidth=0.3,bins = 20)
# Change colors
p <-ggplot(iris, aes(x=Sepal.Length)) +
geom_histogram(binwidth=0.3,bins = 20, color="black", fill="gray")+
theme_void()
p
# Add mean line
p + geom_vline(aes(xintercept=mean(Sepal.Length)),
color="blue", linetype="dashed", size=1)
# Histogram with density plot
ggplot(iris, aes(Sepal.Length)) +
geom_histogram(aes(y= ..density..), colour="black", fill="white")+
geom_density(alpha=0.5, fill="red") + #transparency parameter
theme_minimal()
Warning message: “Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ℹ Please use `linewidth` instead.” Warning message: “The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0. ℹ Please use `after_stat(density)` instead.” `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Change histogram plot line colors by groups
ggplot(iris, aes(x=Sepal.Length, color=Species)) +
geom_histogram(fill="gray")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Change histogram plot line colors by groups
ggplot(iris, aes(x=Sepal.Length,fill=Species, color=Species)) +
geom_histogram(position="identity", alpha=0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Use facets grids
stat_summary <- iris %>%
group_by(Species) %>%
summarise(mean_SepL = mean(Sepal.Length), median_SepL = median(Sepal.Length))
stat_summary <- data.frame(Species = rep(stat_summary$Species, 2), stat = c(stat_summary$mean_SepL, stat_summary$median_SepL), value = rep(c('mean', 'median'), each = 3))
p <- ggplot(iris, aes(x=Sepal.Length))+
geom_histogram(color="black", fill="steelblue")+
facet_grid(Species ~ .) +
geom_vline(data = stat_summary, mapping = aes(xintercept = stat, color = value))
p
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Basic box plot
p <- ggplot(iris, aes(x=Sepal.Length)) +
geom_boxplot()
p
# Horizontal box plot
p + coord_flip()
#box plot for multiple category
ggplot(iris, aes(x=Sepal.Length, y=Species)) +
geom_boxplot()
# Notched box plot
ggplot(iris, aes(x=Sepal.Length, y=Species)) +
geom_boxplot(notch=TRUE)
# Change outlier, color, shape and size
ggplot(iris, aes(x=Sepal.Length, y=Species)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)
Box plot line colors can be automatically controlled by the level variable :
# Change box plot line colors by groups
p<-ggplot(iris, aes(y=Sepal.Length, x=Species, color = Species)) +
geom_boxplot()
p
# Change box plot colors by groups
p<- ggplot(iris, aes(y=Sepal.Length, x=Species, fill= Species)) +
geom_boxplot()
p
p + coord_flip()